Efficient Semantic-Aware Detection of Near Duplicate Resources
نویسندگان
چکیده
Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.
منابع مشابه
Consumer photo management and browsing facilitated by near-duplicate detection with feature filtering
Near-duplicate detection techniques are exploited to facilitate representative photo selection and region-of-interest (ROI) determination, which are important functionalities for efficient photo management and browsing. To make near-duplicate detection module resist to noisy features, three filtering approaches, i.e., point-based, region-based, and probabilistic latent semantic (pLSA), are deve...
متن کاملUsing Page Size for Controlling Duplicate Query Results in Semantic Web
Semantic web is a web of future. The Resource Description Framework (RDF) is a language to represent resources in the World Wide Web. When these resources are queried the problem of duplicate query results occurs. The present techniques used hash index comparison to remove duplicate query results. The major drawback of using the hash index to remove duplicate query results is that, if there is ...
متن کاملAn Efficient Approach for Near-duplicate page detection in web crawling
The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overhea...
متن کاملSpeed-up Multi-modal Near Duplicate Image Detection
Near-duplicate image detection is a necessary operation to refine image search results for efficient user exploration. The existences of large amounts of near duplicates require fast and accurate automatic near-duplicate detection methods. We have designed a coarse-to-fine near duplicate detection framework to speed-up the process and a multi-modal integration scheme for accurate detection. The...
متن کاملData cleansing base on subgraph comparison
With the quick development of the semantic web technology, RDF data explosion has become a challenging problem. Since RDF data are always from different resources, which may have overlap with each other, they could have duplicates. These duplicates may cause ambiguity and even error in reasoning. However, attentions are seldom paid to this problem. In this paper, we study the problem and give a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010